Modeling the diversity and log-normality of data

نویسندگان

  • Khoat Than
  • Tu Bao Ho
چکیده

We investigate two important properties of real data: diversity and log-normality. Log-normality accounts for the fact that data follow the lognormal distribution, whereas diversity measures variations of the attributes in the data. To our knowledge, these two inherent properties have not been paid much attention from the machine learning community, especially from the topic modeling community. In this article, we fill in this gap in the framework of topic modeling. We first investigate whether or not these two properties can be captured by the most well-known Latent Dirichlet Allocation model (LDA), and find that LDA behaves inconsistently with respect to diversity. Particularly, it favors data of low diversity, but works badly on data of high diversity. Then, we argue that these two inherent properties can be captured well by endowing the topic-word distributions in LDA with the lognormal distribution. This treatment leads to a new model, named Dirichlet-lognormal topic model (DLN). Using the lognormal distribution complicates the learning and inference of DLN, compared with those of LDA. Hence, we used variational method, in which model learning and inference are reduced to solving convex optimization problems. Extensive experiments strongly suggest that (1) the predictive power of DLN is consistent with respect to diversity, and that (2) DLN works consistently better than LDA for datasets whose diversity is large, and for datasets which contain many log-normally distributed attributes. Justifications for these results require insights into the used statistical distributions and will be discussed in the article.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Evaluation and Application of the Gaussian-Log Gaussian Spatial Model for Robust Bayesian Prediction of Tehran Air Pollution Data

Air pollution is one of the major problems of Tehran metropolis. Regarding the fact that Tehran is surrounded by Alborz Mountains from three sides, the pollution due to the cars traffic and other polluting means causes the pollutants to be trapped in the city and have no exit without appropriate wind guff. Carbon monoxide (CO) is one of the most important sources of pollution in Tehran air. The...

متن کامل

Investigation of Stability and Relationships between Species Diversity Indices and Topographical Factors (Case Study: Ghorkhud Mountainous Rangeland, Northern Khorasan Province, Iran)

One of the main objectives of ecosystem management is to preserve the speciesdiversity. Many researchers regard higher species diversity as the stability of ecologicalsystems. The aim of this study is to investigate the stability and relationships betweentopographical factors with diversity indices in Ghorkhud mountainous rangeland innorthern Khorasan province, Iran. For data sampling (2012), l...

متن کامل

Comparing Deterministic and Geostatistical Methods in Spatial Distribution Study of Soil Physical and Chemical Properties in Arid Rangelands (Case Study: Masileh Plain, Qom, Iran)

Accurate knowledge of spatial distribution of soil physical and chemical properties is needed for suitable management and proper use of rangelands in Masileh plain, Qom, Iran. In present study, for the spatial modeling of chemical and physical parameters such as sodium (Na), calcium (Ca), soluble potassium (K), magnesium (Mg), Electrical Conductivity (EC), Saturation Percentage (SP%), silt, cla...

متن کامل

Application of spectrum-volume fractal modeling for detection of mineralized zones

The main goal of this research work was to detect the different Cu mineralized zones in the Sungun porphyry deposit in NW Iran using the Spectrum-Volume (S-V) fractal modeling based on the sub-surface data for this deposit. This operation was carried out on an estimated Cu block model based on a Fast Fourier Transformation (FFT) using the C++ and MATLAB programing. The S-V log-log plot was gene...

متن کامل

Underground contour (UGC) mapping using potential field, well log and comparing with seismic interpretation in Lavarestan area

Coastal Fars gravimetry project in Fars province was carried out to find the buried salt domes and to determine characteristics of faults in this area. The Lavarestan structure was covered by 4203 gravimetry stations in a regular grid of 1000*250 m. Depth structural model of this anticline made in previous studies was based on geological evidences and structural geology measurements. In order t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Intell. Data Anal.

دوره 18  شماره 

صفحات  -

تاریخ انتشار 2014